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Abstract 

Mutual information is widely used, in a descriptive way, to measure the 
stochastic dependence of categorical random variables. In order to address 
questions such as the reliability of the descriptive value, one must consider 
sample-to-population inferential approaches. This paper deals with the poste- 
rior distribution of mutual information, as obtained in a Bayesian framework 
by a second-order Dirichlet prior distribution. The exact analytical expression 
for the mean, and analytical approximations for the variance, skewness and 
kurtosis are derived. These approximations have a guaranteed accuracy level 
of the order 0(n -3 ), where n is the sample size. Leading order approximations 
for the mean and the variance are derived in the case of incomplete samples. 
The derived analytical expressions allow the distribution of mutual informa- 
tion to be approximated reliably and quickly. In fact, the derived expressions 
can be computed with the same order of complexity needed for descriptive mu- 
tual information. This makes the distribution of mutual information become 
a concrete alternative to descriptive mutual information in many applications 
which would benefit from moving to the inductive side. Some of these prospec- 
tive applications are discussed, and one of them, namely feature selection, is 
shown to perform significantly better when inductive mutual information is 
used. 
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1 Introduction 

Consider a data set of n observations (or units) jointly categorized according to 
the random variables % and j, in {l,...,r} and {l,...,s}, respectively. The observed 
counts are n:=(nu,...,n rs ), with n:=Y^ij n ij, and the observed relative frequencies 
are 7r := (7rn,...,7r rs ), with Ttij-.— riij/n. The data n are considered as a sample from 
a larger population, characterized by the actual chances 7r:=(7Tn,...,7r rs ), which are 
the population counterparts of it. Both ir and it belong to the rs-dimensional unit 
simplex. 

We consider the statistical problem of analyzing the association between % and 
j, given only the data n. This problem is often addressed by measuring indices 
of independence, such as the statistical coefficient 2 [KS67, pp. 556-561]. In this 
paper we focus on the index I called mutual information (also called cross entropy 
or information gain) [Kul68]. This index has gained a growing popularity, especially 
in the artificial intelligence community. It is used, for instance, in learning Bayesian 
networks [CL68, Pea88, Bun96, Hec98], to connect stochastically dependent nodes; 
it is used to infer classification trees [Qui93]. It is also used to select features for 
classification problems [DHS01], i.e. to select a subset of variables by which to predict 
the class variable. This is done in the context of a filter approach that discards 
irrelevant features on the basis of low values of mutual information with the class 
[Lew92, BL97, CHH+02]. 

Mutual information is widely used in descriptive rather than inductive way. The 
qualifiers 'descriptive' and 'inductive' are used for models bearing on it and 7r, re- 
spectively. Accordingly, it are called relative frequencies, and 7r are called chances. 
At descriptive level, variables % and j are found to be either independent or depen- 
dent, according to the fact that the empirical mutual information I (it) is zero or is a 
positive number. At inductive level, i and j are assessed to be either independent or 
dependent only with some probability, because I(ir) can only be known with some 
(second order) probability. 

The problem with the descriptive approach is that it neglects the variability of 
the mutual information index with the sample, and this is a potential source of 
fragility of the induced models. In order to achieve robustness, one must move from 
the descriptive to the inductive side. This involves regarding the mutual information 
/ as a random variable, with a certain distribution. The distribution allows one to 
make reliable, probabilistic statements about /. 

In order to derive the expression for the distribution of /, we work in the frame- 
work of Bayesian statistics. In particular, we use a second order prior distribution 
p(ir) which takes into account our uncertainty about the chances 7r. From the 
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prior p(7r) and the likelihood we obtain the posterior p(7r|n), of which the posterior 
distribution p(I\n) of the mutual information is a formal consequence. 

Although the problem is formally solved, the task is not accomplished yet. In 
fact, closed-form expressions for the distribution of mutual information are unlikely 
to be available, and we are left with the concrete problem of using the distribution of 
mutual information in practice. We address this problem by providing fast analytical 
approximations to the distribution which have guaranteed levels of accuracy. 

We start by computing the mean and variance of p(I\n). This is motivated by 
the central limit theorem that ensures that p(I\n) can be well approximated by a 
Gaussian distribution for large n. Section 2 establishes a general relationship, used 
throughout the paper, to relate the mean and variance to the covariance structure 
of p(7r|n). By focusing on the specific covariance structure obtained when the prior 
over the chances is Dirichlet, we are then lead to 0(n~ 2 ) approximations for the 
mean and the variance of p(I\n). Generalizing the former approach, in Section 3 
we report 0(n~ 3 ) approximations for the variance, skewness and kurtosis of p(I\n). 
We also provide an exact expression for the mean in Section 4, and improved tail 
approximations for extreme quantiles. 

By an example, Section 5 shows that the approximated distributions, obtained 
by fitting some common distributions to the expressions above, compare well to 
the "exact" one obtained by Monte Carlo sampling also for small sample sizes. 
Section 5 also discusses the accuracy of the approximations and their computational 
complexity, which is of the same order of magnitude needed to compute the empirical 
mutual information. This is an important result for the real application of the 
distribution of mutual information. 

In the same spirit of making the results useful for real applications, and consid- 
ered that missing data are a pervasive problem of statistical practice, we generalize 
the framework to the case of incomplete samples in Section 6. We derive 0(n~ l ) 
expressions for the mean and the variance of p(I\n), under the common assumption 
that data are missing at random [LR87]. These expressions are in closed form when 
observations from one variable, either % or j, are always present, and their complex- 
ity is the same of the complete-data case. When observations from both i and j can 
be missing, there are no closed-form expressions in general but we show that the 
popular expectation-maximization (EM) algorithm [CF74] can be used to compute 
0{n~ l ) expressions. This is possible as EM converges to the global optimum for the 
problem under consideration, as we show in Section 6. 

We stress that the above results are a significant and novel step to the direction 
of robustness. To our knowledge, there are only two other works in literature that 
are close to the work presented here. Kleiter has provided approximations to the 
mean and the variance of mutual information by heuristic arguments [Kle99], but 
unfortunately, the approximations are shown to be crude in general (see Section 2). 
Wolpert and Wolf computed the exact mean of mutual information [WW95, Th.10] 
and reported the exact variance as an infinite sum; but the latter does not allow a 
straightforward systematic approximation to be obtained. 
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In Section 7 we move from the theoretical to the applied side, discussing the 
potential implications of the distribution of mutual information for real applications. 
For illustrative purposes, in the following Section 8, we apply the distribution of 
mutual information to feature selection. We define two new filters based on the 
distribution of mutual information that generalize the traditional filter based on 
empirical mutual information [Lew92]. Several experiments on real data sets show 
that one of the new filters is more effective than the traditional one in the case of 
sequential learning tasks. This is the case for complete data described in Section 
9, as well as incomplete data in Section 10. Concluding remarks are reported in 
Section 11. 



2 Expectation and Variance of Mutual Informa- 
tion 

Setup. Consider discrete random variables 2G{l,...,r} and jG {l,...,s} and an i.i.d. 
random process with outcome G {l,...,r} x {l,...,s} having joint chance iiij. The 
mutual information is defined by 

ij i 3 

where ln denotes the natural logarithm and 7Ti+ = X^ 7r y and n+j=Yli' ir ij are marginal 
chances. Often the descriptive index I (it) = Z^i7 m n"'l+- * s usec ^ * n ^ ne place of 
the actual mutual information. Unfortunately, the empirical index I (it) carries no 
information about its accuracy. Especially I (it) ^0 can have to origins; a true depen- 
dency of the random variables % and j or just a fluctuation due to the finite sample 
size. In the Bayesian approach to this problem one assumes a prior (second order) 
probability density p(iz) for the unknown chances on the probability simplex. 
From this one can determine the posterior distribution p(7r|n) ocp^)^^^ (the 
are multinomially distributed). This allows to determine the posterior probability 
density of the mutual information: 1 

p(I\n) = J S(I(n) - I)p(n\n)d rs ir. (2) 

The 5() distribution restricts the integral to iz for which I(iz) — I. Since 0</(7r)< 
Imax with sharp upper bound I max : = min{lnr,lns}, the domain of p(I\n) is [0,/ max ], 
hence integrals over I may be restricted to such interval of the real line. 

1 /(7r) denotes the mutual information for the specific chances tt, whereas I in the context 
above is just some non-negative real number. I will also denote the mutual information random 
variable in the expectation E[I] and variance Var[7]. Expectations are always w.r.t. to the posterior 
distribution p(ir\n). 



i=l j=l ni+7r+ l 
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For large sample size, p{iz\n) gets strongly peaked around ir = ■k and p(I\n) 
gets strongly peaked around the empirical index I = I{tt). The mean E[I] = f£°I- 
p(I\n) dl = fl(ir)p(ir\n)d rs ir and the variance Var[J] =E[(I -E[I}) 2 } = E[I 2 ]-E[I} 2 
are of central interest. 

General approximation of expectation and variance of /. In the following 
we (approximately) relate the mean and variance of / to the covariance structure 
of p(ir\n). Let 7r:= (7f n ,...,7f rs ), with 71^ := E[iiij}. Since p(7r|n) is strongly peaked 
around 7r = 7r^7r, for large n we may expand around it in the integrals for the 
mean and the variance. With A^ :—7Tij — tt^ e [— 1,1] and using J2ij n ij = ^ = J2ij^ij 
we get the following expansion of expression (1): 

zoo = / W+ W^W+y;^-y;^-£|k +0(A 3), (3) 

where 0(A 3 ) is bounded by the absolute value of (and A 3 is equal to) some ho- 
mogenous cubic polynomial in the r-s variables A^. Taking the expectation, the 
linear term E[Aij]=0 drops out. The quadratic terms E[A i jA k i} = Cov( i j)( kl - ) [ir] are 
the covariance of 7r under p(7r|n) and they are proportional to Equation (9) 
in Section 3 shows that E[A 3 ]=0(n~ 2 ), whence 

E[I] = m ^ - J* - J*) Cov(«)(«) W + 0(n- 2 ). (4) 

The Kronecker delta 5^ is 1 for i—j and otherwise. The variance of / in leading 
order in rT x is 



Var[/] = E[(I-E[I]) 2 ] ~ E 



= V] In _ n%3 _ In _ ^ Cov(ij)( fci ) [7r] , (5) 

where ~ denotes equality up to terms of order n~ 2 . So the leading order term for the 
variance of mutual information I(ir), and the leading and second leading order term 
for the mean can be expressed in terms of the covariance of 7r under the posterior 
distribution p(iz\n). 

The (second order) Dirichlet distribution. Noninformative priors p(n) are 
commonly used if no explicit prior information is available on 7r. Most nonin- 
formative priors lead to a Dirichlet posterior distribution p(7r|n) oc Ylijrfj 3 1 with 
interpretation 2 n^- = n^+n'/ J -, where the n'^ are the number of outcomes and 

2 To avoid unnecessary complications we are abusing the notation: riy is now the sum of real and 
virtual counts, while it formerly denoted the real counts only. In case of Haldane's prior (n^- = 0) , 
this change is ineffective. 
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n"j comprises prior information Explicit prior knowledge may also be specified by 
using virtual units, i.e. by n"j, leading again to a Dirichlet posterior. 
The Dirichlet distribution is defined as follows: 

p(Tz\n) = —y-. — rTTvr™^ 1 5(tt ++ — 1) with normalization 
Af(n) 11 13 



Af{n) = jYl^Si^-W* 



r n) 



where Y is the Gamma function. Mean and covariance of p(7r|n) are 

TTij := E[TTij] — — — 7tij, C0V(ij)(fcj)[7T] = {KijSikSji - 7tij7tkl). (6) 

Expectation and variance of I under Dirichlet priors. Inserting (6) into (4) 
and (5) we get, after some algebra, the mean and variance of the mutual information 
J(7r) up to terms of order rr 2 : 

E[I] * J + (r - 1)(g - 1} , J := V^ln^ = /(*), (7) 
2(n + l) ^ n n i+ n +j K h K ' 

v«[i] . ^y^-^L) 2 . (8) 

J and K (and L, M, P, Q defined later) depend on only, i.e. are 0(1) in n. 
Strictly speaking in (7) we should make the expansion ^j-j- = ^ + 0(n~ 2 ), i.e. drop 
the +1, but the exact expression (6) for the covariance suggests to keep it. We 
compared both versions with the "exact" values (from Monte-Carlo simulations) for 
various parameters it. In many cases the expansion in was more accurate, so 
we suggest to use this variant. 

The first term for the mean is just the descriptive index I {it). The second term 
is a correction, small when n is much larger than r-s. Kleiter [Kle99] determined the 
correction by Monte Carlo studies as minj 1 ^-,^-}. This is only correct if s or r is 
2. The expression 2E[I]/n he determined for the variance has a completely different 
structure than ours. Note that the mean is lower bounded by £2 ^+0(n~ 2 ), which 
is strictly positive for large, but finite sample sizes, even if i and j are statistically 
independent and independence is perfectly represented in the data (7(7r) = 0). On 
the other hand, in this case, the standard deviation a= A/Var[/]~^~£'[/] correctly 
indicates that the mean is still consistent with zero (where f~g means that / and 
g have the same accuracy, i.e. f — 0{g) and g — 0{f)). 

Our approximations for the mean (7) and variance (8) are good if — is small. For 
dependent random variables, the central limit theorem ensures that p{I\n) converges 
to a Gaussian distribution with mean E[I] and variance Var[7]. Since I is non- 
negative it is more appropriate to approximate p{I\ir) as a Gamma (= scaled x 2 ) 
or a Beta distribution with mean E[I] and variance Var[J], which are of course also 
asymptotically correct. 
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3 Higher Moments and Orders 

A systematic expansion of all moments of p(I\n) to arbitrary order in n~ l is possible, 
but gets soon quite cumbersome. For the mean we give an exact expression in Section 
4, so we concentrate here on the variance, skewness and kurtosis of p(I\n). The 3 rd 
and 4 th central moments of 7r under the Dirichlet distribution are 

2 

E[A a A b A c ] = - — — — — -^[2n a n b TT c -f a f b 5 bc -7r b 7r c 5 ca -7r c 7r a 5 ab + i- a 5 ab 5 bc \ (9) 
(n + I){n + 2) 



E[A a A b A c A d ] = ^[3n a n b n c n d - n c n d 7f a 5 ab - n b 7f d 7f a 5 ac - n b 7r c 7r a 5 ad (10) 

+K a TT c 5 ab 5 cd + K a TX b 5 ac 5 bd + 7t a 7t b S ad 5 bc 

with a — ij, b—kl,... e {l,...,r} x {l,...,s} being double indices, 5 ab = 5ikSji,-.. ■n i j = 1 ^-. 
Expanding A fe = (n — Tf) k in E[A a A b ...] leads to expressions containing E[n a n b ..], 
which can be computed by a case analysis of all combinations of equal/unequal 
indices a,b,c,... using (6). Many terms cancel out leading to the above expressions. 
They allow us to compute the order n~ 2 term of the variance of I(tt). Again, 
inspection of (9) suggests to expand in [(n + l)(n+2)] _1 , rather than in n~ 2 . The 
leading and second leading order terms of the variance are given below, 

Var |/] = 3 

[ J n+1 (n + l)(ra + 2) V ; K J 



E( 1 1 1 1\ nan 
+ -)n ij hi—2—, (12) 



M 



n 2 



Q := (13) 



ni + n +J 



J and are defined in (7) and (8). Note that the first term K n+ J ^ also contains 
second order terms when expanded in rT 1 . The leading order terms for the 3 rd and 
4 th central moments of p(I\n) are 



E[(I - E[I]) 3 ] = ^[2J 3 - 3KJ + L] + hK + J 2 - P] + 0(n" 3 ) 



L ._J2 Uij ( In UijU ) p._^ n ( J »+) 2 | ^ n ( J +i) 2 j...^ r hi\ n n ^ n 
n \ u , ii . j J rii + ii . j n ,,,,,, , 

E\{I-E[I]Y] = ^ 2 [K-J 2 f + 0{n~% 

from which the skewness and kurtosis can be obtained by dividing by Var[/] 3//2 and 
Var[J] 2 , respectively. One can see that the skewness is of order n -1 ^ 2 and the kurtosis 
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is 3 + 0(n _1 ). Significant deviation of the skewness from or the kurtosis from 3 
would indicate a non-Gaussian /. These expressions can be used to get an improved 
approximation for p(I\n) by making, for instance, an ansatz 

p(I\n) oc (1 + bl + cl 2 ) ■ p (I\jl,a 2 ) 

and fitting the parameters b, c, jl, and a 2 to the mean, variance, skewness, and 
kurtosis expressions above. p is any distribution with Gaussian limit. ^From this, 
quantiles p(I>I*\n) := ff°p(I\n) dl, needed later (and in [Kle99]), can be computed. 
A systematic expansion of arbitrarily high moments to arbitrarily high order in n^ 1 
leads, in principle, to arbitrarily accurate estimates (assuming convergence of the 
expansion). 



4 Further Expressions 

Exact value for E[I\. It is possible to get an exact expression for the mean mutual 
information E[I] under the Dirichlet distribution. By noting that x\nx= -^x^\^=i, 
(x = {TTij,7ii + ,n + j}), one can replace the logarithms in the last expression of (1) by 
powers. From (6) we see that E[(7r ij ) f3 \ = rj^.^i+^j • Taking the derivative and 
setting (5 — 1 we get 

Efrij In Try] = ^E[(«ij)%=i = + 1) - ^(n + 1)]. 

The ip function has the following properties (see [AS74] for details): 

n— 1 . n ^ 

^) = -7 + E]fc' ^+|) = -7 + 21n2 + 2^^- T . (14) 
fc=i fc=i 
The value of the Euler constant 7 is irrelevant here, since it cancels out. Since the 
marginal distributions of Tr i+ and n + j are also Dirichlet (with parameters n« + and 
n + j), we get similarly 

E[ix i+ hiix i+ } = -y^n i+ [ifj(n i+ + l) -ip{n + l)], 
E[n +j lnvr +i ] = - ^n +j [^(n +j + 1) - ^{n + 1)]. 



n 

3 



Inserting this into (1) and rearranging terms we get the exact expression 

E[I] = -^2niM n ij + + 1) - i>(n + 3 + 1) + + !)]■ ( 15 ) 



n 
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(This expression has independently been derived in [WW95] in a different way.) For 
large sample sizes, i[)(z+l) ~lnz and (15) approaches the descriptive index I (it) as 
it should. Inserting the expansion ip(z+l) = \nz + ^ + ... into (15) we also get the 
correction term ( - r ~ 1 ^~ 1 ' > of (7). 

The presented method (with some refinements) may also be used to determine 
an exact expression for the variance of I (it). All but one term can be expressed in 
terms of Gamma functions. The final result after differentiating w.r.t. (3\ and fa can 
be represented in terms of ip and its derivative ip' . The mixed term E[(7ii + ) 131 (n+jY 2 } 
is more complicated and involves confluent hypergeometric functions, which limits 
its practical use [WW95]. 

Large and small I asymptotics. For extreme quantiles i*~0 or I*~I m ax, the 
accuracy of the derived approximations in the last sections can be poor and it is 
better to use tail approximations. In the following we briefly sketch how the scaling 
behavior of p(I\n) can be determined. 

We observe that I(ir) is small iff ~k %3 describes near independent random variables 
% and j. This suggests the reparametrization v:^ = 7r i+ 7r + j + in the integral (2). 
In order to make this representation unique and consistent with tt ++ — 1, we have to 
restrict the r + s + rs degrees of freedom (TT i+ ,n + j,Aij) to rs — 1 degrees of freedom 
by imposing r + s+1 constraints, for instance — X^+j — 1 an d A i+ = A + j = 

(A ++ = occurs twice). Only small A can lead to small I(n). Hence, for small I 
we may expand I(ir) in A in expression (2). Inserting iTij = n i+ iT + j + Aij into (3), 
we get I^i+f+j + Aij) = A T H(ir)A + 0(A 3 ) with H^ j )Q.i)^\[^ji/^ij-^ik/^i+- 
Sji/n + j] (cf. (4)) and H and A interpreted as rs-dimensional matrix and vector. 
A T H(ir)A = I describes an rs-dimensional ellipsoid of linear extension o:\fl. Due 
to the r+s—1 constraints on A, the A-integration is actually only over, say, Aj_ and 
A^H ±(it)A± = I describes the surface of a d:= (r — l)(s — l)-dimensional ellipsoid 
only. Approximating p(7r|n) by p(it\n) in (2), where ff ^ = # j + 7f + j we get 

p(I\n) = B(n) ■ ii- 1 + o(I^- v ) with B(n) = J S ± (ir)p(ir\n)d r+s - 2 ir 

where S± = r (d/2)^/dctH ± ^ s ^ e ellipsoid's surface (n = 3.14...). Note that dir still 
contains a Jakobian from the non-linear coordinate transformation. So the small / 
asymptotics is p(I\n) oc/a -1 (for any prior), but a closed form expression for the 
coefficient B(n) has yet to be derived. 

Similarly we may derive the scaling behavior of p(I\n) for I I max := 
min{lnr,lns}. I(ir) can be written as H(i)—H(i\j), where H is the entropy. Without 
loss of generality we may assume r<s. H(i) < lnr with equality iff ir.i + = - for all i. 
H(t\j)>0 with equality iff % is a deterministic function of j. Together, I(ir) = I max iff 
Kij = ^«,m(j) -cr j) where m:{l...s}— >{l..r} is any onto map and the <7j>0 respect the 
constraints Ylj&m- 1 {i) (J 3 = ^- ^ m s suggests the reparametrization ir i j = ^5 ijm (j)<j j+ A^ 
in the integral (2) for each choice of m() and suitable constraints on a and A. 
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5 Numerics 

In order to approximate the distribution of mutual information in practice, one needs 
consider implementation issues and the computational complexity of the overall 
method. This is what we set out to do in the following. 

Computational complexity and accuracy. Regarding computational complex- 
ity, there are short and fast implementations of ip. The code of the Gamma function 
in [PFTV92], for instance, can be modified to compute the ip function. For inte- 
ger and half-integer values one may create a lookup table from (14). The needed 
quantities J, K, L, M, and Q (depending on n) involve a double sum, P only a 
single sum, and the r + s quantities Jj + and J + j also only a single sum. Hence, the 
computation time for the (central) moments is of the same order 0(r-s) as for I (it). 

With respect to the quality of the approximation, let us briefly consider the case 
of the variance. The expression for the exact variance has been Taylor-expanded in 
(f ), so the relative error Ww[I]ap ^^^ Uact of the approximation is of the order 
(^) 2 , if i and j are dependent. In the opposite case, the 0(n~ r ) term in the sum 
drops itself down to order rT 2 resulting in a reduced relative accuracy O(^) of 
the approximated variance. These results were confirmed by numerical experiments 
that we realized by Monte Carlo simulation to obtain "exact" values of the variance 
for representative choices of 7Ty, r, s, and n. The approximation for the variance, 
together with those for the skewness and kurtosis, and the exact expression for the 
mean, allow a good description of the distribution p(I\n) to be obtained for not too 
small sample bin sizes riy. 

We want to conclude with some notes on useful accuracy. The hypothetical prior 
sample sizes n'^ = {0,^,^,1} can all be argued to be non- informative [GCSR95]. 
Since the central moments are expansions in n" 1 , the second leading order term 
can be freely adjusted by adjusting n'{j G [0...1]. So one may argue that anything 
beyond the leading order term is free to will, and the leading order terms may be 
regarded as accurate as we can specify our prior knowledge. On the other hand, exact 
expressions have the advantage of being safe against cancellations. For instance, the 
leading orders of E[I] and E[I 2 ] do not suffice to compute the leading order term of 
Var[/]. 

Approximating the distribution. Let us now consider approximating the overall 
distribution of mutual information based on the mean and the variance. Fitting a 
normal distribution is an obvious possible choice, as the central limit theorem ensures 
that p(I\n) converges to a Gaussian distribution with mean E[I] and variance Var[/]. 
Since / is non-negative, it is also worth considering the approximation of p(I\ir) by 
a Gamma (i.e., a scaled x 2 )- Another natural candidate is the Beta distribution, 
which is defined for variables in the [0,1] real interval. / can be made such a variable 
by a simple normalization. Of course the Gamma and the Beta are asymptotically 
correct, too. 

We report a graphical comparison of the different approximations by focusing 
on the special case of binary random variables, and on three possible vectors of 



Distribution of Mutual Information 



11 



10 




I = O..l_max=[log(min(r,s))] 



Figure 1: Distribution of mutual information for two binary random variables (The 
labelling of the horizontal axis is the percentage o//_max.J There are three groups of 
curves, for different choices of counts (7111,7112,7121,7122)- The upper group is related 
to the vector (40,10,20,80), the intermediate one to the vector (20,5,10,40), and 
the lower group to (8,2,4,16). Each group shows the "exact" distribution and three 
approximating curves, based on the Gaussian, Gamma and Beta distributions. 

counts. Figure 1 compares the "exact" distribution of mutual information, computed 
via Monte Carlo simulation, with the approximating curves. These curves have 
been fitted using the exact mean and the approximated variance of the preceding 
section. The figure clearly shows that all the approximations are rather good, with 
a slight preference for the Beta approximation. The curves tend to do worse for 
smaller sample sizes, as expected. Higher moments may be used to improve the 
accuracy (Section 3), or this can be improved using our considerations about tail 
approximations in Section 4. 

6 Expressions for Missing Data 

In the following we generalize the setup to include the case of missing data, which 
often occurs in practice. We extend the counts 71^ to include n?j, which counts the 
number of instances in which only j is observed (i.e., the number of (?,j) instances), 
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and the counts for the number of instances, where only i is observed. 

We make the common assumption that the missing data mechanism is ignorable 
{missing at random and distinct) [LR87]. The probability distribution of j given 
that % is missing coincides with the marginal 7r + j, and vice versa, as a consequence 
of this assumption. 

Setup. The sample size n is now n c +n + ?+n? + , where n c is the number of complete 
units. Let n — (71^,71^,717 j) denote as before the vector of counts, now including the 
counts and n?j, for all i and j. The probability of a specific data set D, given 

7T, hence, is p(D\7T,n c ,n + 7,n7 + ) = Y[ij^j 3 Y[i n f+Y[j n +j- Assuming a uniform prior 
p(tz) oc 1-S(7t ++ — 1), Bayes' rule leads to the posterior (which is also the likelihood 
in case of uniform prior) 



p(7r|n) = Wn) n n n 

ij * 



where the normalization M is chosen such that J p(n\n)d rs ir=l. With missing data 
there is, in general, no closed form expression for j\f any more (cf. (6)). 

In the following, we restrict ourselves to a discussion of leading order (in n~ v ) 
expressions. In leading order, any Dirichlet prior with n"j = 0(l) leads to the same 
results, hence we can simply assume a uniform prior. In leading order, the mean 
E[ir] coincides with the mode of p(n\n), i.e. the maximum likelihood estimate of 
7r. The log-likelihood function lnp(7r|n) is 

L(Tz\n) = nij In iTij + + n?j ln7r + j — In M(n) — A(7r ++ — 1), 

ij i j 

where we have introduced the Lagrange multiplier A to take into account the restric- 
tion tt ++ = 1. The maximum is at 7 ^ = ^- + ^L + ^i-A = 0. Multiplying this by 

OTTij 7Tij ^"+J 

TTij and summing over % and j we obtain X — n. The maximum likelihood estimate 
7r is, hence, given by 

Kij = - riij + rii?-^- + 7i7j^- . (16) 



71 \ 7T i+ 7T +J 



This is a non-linear equation in 71"^, which, in general, has no closed form solution. 
Nevertheless Eq. (16) can be used to approximate 7^. Eq. (16) coincides with 
the popular expectation-maximization (EM) algorithm [CF74] if one inserts a first 
estimate ti% = — into the r.h.s. of (16) and then uses the resulting l.h.s. frj 4 as a new 

% 3 n v ' u 

estimate, etc. This iteration (quickly) converges to the maximum likelihood solution 
(if missing instances are not too frequent). Using this we can compute the leading 
order term for the mean of the mutual information (and of any other function of 7Tjj): 
E[I] =I(ir) + 0(n- 1 ). The leading order term for the covariance can be obtained 
from the second derivative of L. 
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Unimodality of p(7r|n). The rsxrs Hessian matrix H e ]R rs ' rs of — L and the 
second derivative in the direction of the rs-dimensional column vector v e M rs are 

H (i3){kl)W ■= - o o = ~T d ikOjl + -j-difc + -2-Oji, 

v Hv = ^ Vij H {ij)m v kl = 2^^2 v v + l^^r v i+ + 2^^r v +3 ^ °- 

This shows that — L is a convex function of 7r, hence p(7r]n) has a single (possibly 
degenerate) global maximum. L is strictly convex if riij>0 for all ij, since v T Hv>0 
\/v^0 in this case. (Note that positivity of n,? for all i is not sufficient, since v i + = 
for w^O is possible. Actually t> ++ = 0.) This implies a unique global maximum, 
which is attained in the interior of the probability simplex. Since EM is known to 
converge to a local maximum, this shows that in fact EM always converges to the 
global maximum. 



Covariance of 7r. With 



Pij Pi? p?j 



"2 ~ 9 "2 

n ii v i+ n +i / \ 

Pij := n— , p i? := n— , p 7j := n — - (17) 

n-ij rii? n?j 

and A:=7r — 7r, we can represent the posterior to leading order as an (rs — 1)- 
dimensional Gaussian: 

p(7r|n) ~ e-^ TA *5(A ++ ). (18) 

The easiest way to compute the covariance (and other quantities) is to also rep- 
resent the (^-function as a narrow Gaussian of width e ~ 0. Inserting 5(A ++ ) m 
— 7=exp(— 2^-A T ee T A) into (18), where eij — 1 for all ij (hence e T A = A ++ ), leads 

to a full rs-dimensional Gaussian with kernel A = A + uv T , ■u = u = ie. The co- 

i £ 
variance of a Gaussian with kernel A is A . Using the Sherman-Morrison formula 

A' 1 = A' 1 -A' 1 1+ ™a-i u A' 1 [PFTV92, p. 73] and e^O we get 



Cov(ij)( fcJ )[7r] := ElAijAki] ~ [A \ij)(ki) 



A- 1 - 



A- 1 ee T A- i 
e T A- 1 e 



, (19) 

J (ij)(ki) 



where ~ denotes equality up to terms of order n~ 2 . Singular matrices A are easily 
avoided by choosing a prior such that riij > for all % and j. A may be inverted 
exactly or iteratively, the latter by a trivial inversion of the diagonal part Sn-Sji/pij 
and by treating d~ik/ Pi? + Sji/ p?j as a perturbation. 

Missing observations for one variable only. In the case only one variable is 
missing, say n?j — 0, closed form expressions can be obtained. If we sum (16) over j 
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we get fc i+ = "'++"' ? , Inserting Tf i+ = re '++"' ? into the r.h.s. of (16) and solving w.r.t. 
Ttij, we get the explicit expression 



n i+ +ni? rii 



n. 



n 



(20) 



Furthermore, it can easily be verified (by multiplication) that = n[5ikSji/ Pij + 

Sik/Pii] has inverse [A'\ ij){kl) = ^\p ij 5 ik 5 j i-j^^5 ik }. With the abbreviations 

Pi? 



Qi? ■'- 



Pi? + Pi+ 



and Q:=^2p i+ Qi 



we get [A 1 e] ij = J2kii A \ij){ki) = ^PijQi? and e T 'A 1 e = Q/n. Inserting everything 
into (19) we get 



CoV fe')(fc/)[7T] ^ 



n 



r r PijPkl c PijQi?PklQ 

PijOikOjl ; Oik "~ 

Pt++Pi? Q 



k? 



Inserting this expression for the covariance into (5), using it := E[n] = 7r + 0(n r ), 
we finally get the leading order term in 1/n for the variance of mutual information: 



Var[J] ~ -[K- J 2 /Q-P], 

71/ 



E 



Jj+Qi 
Pi? 



J := ^Ji+Qi?, J i+ := ^ Pij In 



7T. 



A closed form expression for M{n) also exists. Symmetric expressions for the case 
when only % is missing can be obtained. Note that for the complete data case n^ — O, 
we have 7%- = py = — , pj? = oo, Qi7 = Q = l, J— J, K = K, and P = 0, consistent with 
(8). 

There is at least one reason for minutely having inserted all expressions into 
each other and introducing quite a number definitions. In the presented form all 
expressions involve at most a double sum. Hence, the overall time for computing 
the mean and variance when only one variable is missing is 0(rs). 

Expressions for the general case. In the general case when both variables 
are missing, each EM iteration (16) for needs 0(rs) operations. The naive 
inversion of A needs time 0((rs) 3 ), and using it to compute Var[J] time 0((rs) 2 ). 
Since the contribution from unlabelled-2 instances can be interpreted as a rank s 
modification of A in the case of when i is not missing, one can use Woodbury's 
formula [B+U DV 7 ]^ 1 = B^ 1 — B~ 1 U[D~ 1 + V T B~ 1 U]~ 1 V T B" 1 [PFTV92, p. 75] 
with B^ k i) = SikSji/pij+5ik/pi?, Dji — 5ji/p?j, and U(ij)i — V(ij)i — 5ji, to reduce the 
inversion of the rsxrs matrix A to the inversion of a single s-dimensional matrix. 
The result can be written in the form 



[A-\ 



(ij)(ki) 



n 



Pijl^ik ^ ^ F'ij m [G ]mn^kln 



(22) 
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Fiji ■ Pij3jl j G mn . Pln&mn ~\~ F +1nn . 

Pi?+Pi+ 

The result for the covariance (19) can be inserted into (5) to obtain the leading 
order term for the variance: 

Var[J] ~ l T A-H - (l T A- 1 e) 2 /(e T A- 1 e) where ^:=ln^-. (23) 

TX i+ 1l +j 

Inserting (22) into (23) and rearranging terms appropriately, we can compute Var[7] 
in time 0(rs) plus the time 0(s 2 r) to compute the sxs matrix G and time 0(s 3 ) 
to invert it, plus the time 0(#-rs) for determining j}^, where # is the number of 
iterations of EM. Of course, one can and should always choose s < r. Note that 
these expressions converge to the exact values when n goes to infinity, irrespectively 
of the amount of missing data. 



7 Applications 

The results in the preceding sections provide fast and reliable methods to approx- 
imate the distribution of mutual information from either complete or incomplete 
data. The derived tools have been obtained in the theoretically sound framework of 
Bayesian statistics, which we regard as their basic justification. As these methods 
are available for the first time, it is natural to wonder what their possible uses can be 
on the application side or, stated differently, what can be gained in practice moving 
from descriptive to inductive methods. We believe that the impact on real applica- 
tions can be significant, according to three main scenarios: robust inference methods, 
inferring models that perform well, and fast learning from massive data sets. In the 
following we use classification as a thread to illustrate the above scenarios. Classifi- 
cation is one of the most important techniques for knowledge discovery in databases 
[DHS01]. A classifier is an algorithm that allocates new objects to one out of a finite 
set of previously defined groups (or classes) on the basis of observations on several 
characteristics of the objects, called attributes or features. Classifiers are typically 
learned from data, making explicit the knowledge that is hidden in databases, and 
using this knowledge to make predictions about new data. 

Robust inference methods. An obvious observation is that descriptive methods 
cannot compete, by definition, with inductive ones when robustness is concerned. 
Hence, the results presented in this paper lead naturally to a spin-off for reliable 
methods of inference. 

Let us focus on classification problems, for the sake of explanation. Applying 
robust methods to classification means to produce classifications that are correct 
with a given probability. It is easy to imagine sensible (e.g., nuclear, medical) 
applications where reliability of classification is a critical issue. To achieve reliability, 
a necessary step consists in associating a posterior probability (i.e., a guarantee level) 
to classification models inferred from data, such as classification trees or Bayesian 
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nets. Let us consider the case of Bayesian networks. These are graphical models that 
represent structures of (in) dependence by directed acyclic graphs, where nodes in the 
graph are regarded as random variables [Pea88, Nea04]. Two nodes are connected by 
an arc when there is direct stochastic dependence between them. Inferring Bayesian 
nets from data is often done by connecting nodes with significant value of descriptive 
mutual information. Little work has been done on robustly inferring Bayesian nets, 
probably because of the difficulty to deal with the distribution of mutual information, 
with the notable exception of Kleiter's work [Kle99]. Joining Kleiter's work with 
ours might lead to inference of Bayesian network structures that are correct with a 
given probability. Some work has already been done to this direction [ZH03]. 

Feature selection might also benefit from robust methods. Feature selection is 
the problem of reducing the number of feature variables to deal with in classification 
problems. Features can reliably be discarded only when they are irrelevant to the 
class with high probability. This needs knowledge of the distribution of mutual 
information. In Section 8 we propose a filter based on the distribution of mutual 
information to address this problem. 

Inferring models that perform well. It is well-known that model complexity 
must be in proper balance with available data in order to achieve good classification 
accuracy. In fact, unjustified complexity of inferred models leads classifiers almost 
inevitably to overfitting, i.e. to memorize the available sample rather than extracting 
regularities from it that are needed to make useful predictions on new data [DHS01]. 
Overfitting could be avoided by using the distribution of mutual information. With 
Bayesian nets, for example, this could be achieved by drawing arcs between nodes 
only if these are supported by data with high probability. This is a way to impose 
a bias towards simple structures. It has to be verified whether or not this approach 
can systematically lead to better accuracy. 

Model complexity can also be reduced by discarding features. This can be 
achieved by including a feature only when its mutual information with the class 
is significant with high probability. This approach is taken in Section 8, where 
we show that it can effectively lead to better prediction accuracy of the resulting 
models. 

Fast learning from massive data sets. Another very promising application of 
the distribution of mutual information is related to massive data sets. These are 
huge samples, which are becoming more and more available in real applications, and 
which constitute a serious challenge for machine learning and statistical applications. 
With massive data sets it is impractical to scan all the data, so classifiers must be 
reliably inferred by accessing only a small subset of the units. Recent work has 
highlighted [PM03] how inductive methods allow this to be realized. The intuition 
is the following: the inference phase stops reading data when the inferred model, 
say a Bayesian net, has reached a given posterior probability. By choosing such 
probability sufficiently high, one can be arbitrarily confident that the inferred model 
will not change much by reading the neglected data, making the remaining units 
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superfluous. 

8 Feature Selection 

Feature selection is a basic step in the process of building classifiers [BL97, DL97, 
LM98]. In fact, even if theoretically more features should provide one with better 
prediction accuracy (i.e., the relative number of correct predictions), in real cases 
it has been observed many times that this is not the case [KS96] and that it is 
important to discard irrelevant, or weakly relevant features. 

The purpose of this section is to illustrate how the distribution of mutual infor- 
mation can be applied in this framework, according to some of the ideas in Section 
7. Our goal is inferring simple models that avoid overfitting and have an equivalent 
or better accuracy with respect to models that consider all the original features. 

Two major approaches to feature selection are commonly used in machine learn- 
ing [JKP94]: filter and wrapper models. The filter approach is a preprocessing 
step of the classification task. The wrapper model is computationally heavier, as it 
implements a search in the feature space using the prediction accuracy as reward 
measure. In the following we focus our attention on the filter approach: we define 
two new filters and report experimental analysis about them, both with complete 
and incomplete data. 

The proposed filters. We consider the well-known filter (F) that computes the 
empirical mutual information between features and the class, and discards low- 
valued features [Lew92]. This is an easy and effective approach that has gained 
popularity with time. Cheng reports that it is particularly well suited to jointly 
work with Bayesian network classifiers, an approach by which he won the 2001 
international knowledge discovery competition [CHH + 02]. The 'Weka' data mining 
package implements it as a standard system tool (see [WF99, p. 294]). 

A problem with this filter is the variability of the empirical mutual information 
with the sample. This may cause wrong judgments of relevance, when those features 
are selected for which the mutual information exceeds a fixed threshold e. In order 
for the selection to be robust, we must have some guarantee about the actual value 
of mutual information. 

We define two new filters. The backward filter (BF) discards an attribute if p(I < 
e\n) >p where I denotes the mutual information between the feature and the class, e 
is an arbitrary (low) positive threshold and p is an arbitrary (high) probability. The 
forward filter (FF) includes an attribute if p(I > e\n)>p, with the same notations. 
BF is a conservative filter, along the lines discussed about robustness in Section 7, 
because it will only discard features after observing substantial evidence supporting 
their irrelevance. FF instead will tend to use fewer features (aiming at producing 
classifiers that perform better), i.e. only those for which there is substantial evidence 
about them being useful in predicting the class. 
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The next sections present experimental comparisons of the new filters and the 
original filter F. 

Experimental methodology. For the following experiments we use the naive 

Bayes classifier [DH73]. This is a good classification model — despite its simplifying 
assumptions [DP97] — , which often competes successfully with much more complex 
classifiers from the machine learning field, such as C4.5 [Qui93]. The experiments 
focus on the incremental use of the naive Bayes classifier, a natural learning process 
when the data are available sequentially: the data set is read instance by instance; 
each time, the chosen filter selects a subset of attributes that the naive Bayes uses 
to classify the new instance; the naive Bayes then updates its knowledge by taking 
into consideration the new instance and its actual class. The incremental approach 
allows us to better highlight the different behaviors of the empirical filter (F) and 
those based on the distribution of mutual information (BF and FF). In fact, for 
increasing sizes of the learning set the filters converge to the same behavior. 

For each filter, we are interested in experimentally evaluating two quantities: for 
each instance of the data set, the average number of correct predictions (namely, 
the prediction accuracy) of the naive Bayes classifier up to such instance; and the 
average number of attributes used. By these quantities we can compare the filters 
and judge their effectiveness. 

The implementation details for the following experiments include: using the Beta 
approximation (Section 5) to the distribution of mutual information, with the exact 
mean (15) and the 0(n~ 3 )-approximation of the variance, given in (11); using the 
uniform prior for the naive Bayes classifier and all the filters; and setting the level 
p for the posterior probability to 0.95. As far as e is concerned, we cannot set it 
to zero because the probability that two variables are independent (J = 0) is zero 
according to the inferential Bayesian approach. We can interpret the parameter e as 
a degree of dependency strength below which attributes are deemed irrelevant. We 
set e to 0.003, in the attempt of only discarding attributes with negligible impact on 
predictions. As we will see, such a low threshold can nevertheless bring to discard 
many attributes. 

9 Experimental analysis with incomplete samples 

Table 1 lists ten data sets used in the experiments for complete data. These are 
real data sets on a number of different domains. For example, Shuttle-small reports 
data on diagnosing failures of the space shuttle; Lymphography and Hypothyroid 
are medical data sets; Spam is a body of e-mails that can be spam or non-spam; 
etc. 

The data sets presenting non-categorical features have been pre-discretized by 
MLC++ [KJL + 94], default options, i.e. by the common entropy-based discretization 
[FI93]. This step may remove some attributes judging them as irrelevant, so the 
number of features in the table refers to the data sets after the possible discretization. 
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Crx 


15 


653 


0.547 


German-org 


17 


1000 


0.700 


Hypothyroid 


23 


2238 


0.942 


Led24 


24 


3200 


0.105 


Lymphography 


18 


148 


0.547 


Shuttle-small 


8 


5800 


0.787 


Spam 


21611 


1101 


0.563 


Vote 


16 


435 


0.614 



Table 1: Complete data sets used 
in the experiments, together with 
their number of features, of in- 
stances and the relative frequency 
of the mode. All but the Spam 
data sets are available from the 
UCI repository of machine learn- 
ing data sets [MA95]. The Spam 
data set is described in [AKC+00] 
and available from Androutsopou- 
los's web page. 



The instances with missing values have been discarded, and the third column in the 
table refers to the data sets without missing values. Finally, the instances have been 
randomly sorted before starting the experiments. 

Results. In short, the results show that FF outperforms the commonly used filter 
F, which in turn, outperforms the filter BF. FF leads either to the same prediction 
accuracy as F or to a better one, using substantially fewer attributes most of the 
times. The same holds for F versus BF. 

In particular, we used the two-tails paired t test at level 0.05 to compare the 
prediction accuracies of the naive Bayes with different filters, in the first k instances 
of the data set, for each k. 

The results in Table 2 show that, despite the number of used attributes is often 
substantially different, both the differences between FF and F, and the differences 
between F and BF, were never statistically significant on eight data sets out of ten. 



Figure 2: Comparison of the pre- 
diction accuracies of the naive 
Bayes with filters F and FF on the 
Chess data set. The gray area de- 
notes differences that are not sta- 
tistically significant. 
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The remaining cases are described by means of the following figures. Figure 2 
shows that FF allowed the naive Bayes to significantly do better predictions than F 
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Data set 


#feat. 


"CPU 1 

t t 


t 


"DTP 
tit 


Australian 


36 


32.6 


34.3 


35.9 


Chess 


36 


12.6 


18.1 


26.1 


Crx 


15 


11.9 


13.2 


15.0 


German-org 


17 


5.1 


8.8 


15.2 


Hypothyroid 


23 


4.8 


8.4 


17.1 


Led24 


24 


13.6 


14.0 


24.0 


Lymphography 18 


18.0 


18.0 


18.0 


Shuttle-small 


8 


7.1 


7.7 


8.0 


Spam 


21611 


123.1 


822.0 


13127.4 


Vote 


16 


14.0 


15.2 


16.0 



Table 2: Average number of at- 
tributes selected by the filters on 
the entire data set, reported in 
the last three columns. (Refer 
to the Section 'The proposed fil- 
ters ' for the definition of the fil- 
ters.) The second column from 
left reports the original number 
of features. In all but one case, 
FF selected fewer features than 
F, sometimes much fewer; F usu- 
ally selected much fewer features 
than BF, which was very conser- 
vative. Boldface names refer to 
data sets on which prediction ac- 
curacies where significantly dif- 
ferent. 



for the greatest part of the Chess data set. The maximum difference in prediction 
accuracy is obtained at instance 422, where the accuracies are 0.889 and 0.832 for 
the cases FF and F, respectively. Figure 2 does not report the BF case, because 
there is no significant difference with the F curve. The good performance of FF was 
obtained using only about one third of the attributes (Table 2). 




Figure 3: Prediction accuracies 
of the naive Bayes with filters F, 
FF and BF on the Spam data set. 
The differences between F and FF 
are significant in the range of ob- 
servations 32-413. The differ- 
ences between F and BF are sig- 
nificant from observations 65 to 
the end (this significance is not 
displayed in the picture). 
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Instance number 

Figure 3 compares the accuracies on the Spam data set. The difference between 
the cases FF and F is significant in the range of instances 32-413, with a maximum 
at instance 59 where accuracies are 0.797 and 0.559 for FF and F, respectively. BF 
is significantly worse than F from instance 65 to the end. This excellent performance 
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of FF is even more valuable considered the very low number of attributes selected 
for classification. In the Spam case, attributes are binary and correspond to the 
presence or absence of words in an e-mail and the goal is to decide whether or not 
the e-mail is spam. All the 21611 words found in the body of e-mails were initially 
considered. FF shows that only an average of about 123 relevant words is needed to 
make good predictions. Worse predictions are made using F and BF, which select, 
on average, about 822 and 13127 words, respectively. Figure 4 shows the average 
number of excluded features for the three filters on the Spam data set. FF suddenly 
discards most of the features, and keeps the number of selected features almost 
constant over all the process. The remaining filters tend to such a number, with 
different speeds, after initially including many more features than FF. 

In summary, the experimental evidence supports the strategy of only using the 
features that are reliably judged to carry useful information to predict the class, pro- 
vided that the judgment can be updated as soon as new observations are collected. 
FF almost always selects fewer features than F, leading to a prediction accuracy at 
least as good as the one F leads to. The comparison between F and BF is analogous, 
so FF appears to be the best filter and BF the worst. This is not surprising as BF 
was designed to be conservative and was used here just as a term of comparison. 
The natural use of BF is for robust classification when it is important not to discard 
features potentially relevant to predict the class. 



10 Experimental analysis with incomplete sam- 
ples 

This section makes experimental analysis on incomplete data along the lines of the 
preceding experiments. The new data sets are listed in Table 3. 

The filters F and FF are defined as before. However, now the mean and variance 
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Name 



#feat. #inst. #m.d. mode freq. 



Audiology 


69 


226 


317 


0.212 


Crx 


15 


690 


67 


0.555 


Horse-colic 


18 


368 


1281 


0.630 


Hypothyroidloss 


23 


3163 


1980 


0.952 


Soybean-large 


35 


683 


2337 


0.135 



Table 3: Incomplete data sets 
used for the new experiments, to- 
gether with their number of fea- 
tures, instances, missing values, 
and the relative frequency of the 
mode. The data sets are available 
from the UCI repository of ma- 
chine learning data sets [MA95]. 



of mutual information are obtained by using the results in Section 6, in particular 
the closed-form expressions for the case when only one variable is missing. In fact, in 
the present data sets the class is never missing, as it is quite common classification 
tasks. We remark that the mean is simply approximated now as I(tt), where 7r is 
given by (20), whereas the variance is reported in (21). Furthermore, note that also 
the traditional filter F, as well as the naive Bayes classifier, are now computed using 
the empirical probabilities (20). The remaining implementation details are as in the 
case of complete data. 



Data set 


#feat. 


FF 


F 


BF 


Audiology 


69 


64.3 


68.0 


68.7 


Crx 


15 


9.7 


12.6 


13.8 


Horse-colic 


18 


11.8 


16.1 


17.4 


Hypothyroidloss 23 


4.3 


8.3 


13.2 


Soybean-large 


35 


34.2 


35.0 


35.0 



Table 4: Average number of at- 
tributes selected by the filters on 
the entire data set, reported in the 
last three columns. The second 
column from left reports the origi- 
nal number of featurs. FF always 
selected fewer features than F; F 
almost always selected fewer fea- 
tures than BF. Prediction accu- 
racies where significantly different 
for the Hypothyroidloss data set. 



The results in Table 4 show that the filters behave very similarly to the case of 
complete data. The filter FF still selects the smallest number of features, and this 
number usually increases with F and even more with BF. The selection can be very 
pronounced, as with the Hypothyroidloss data set. This is also the only data set for 
which the prediction accuracies of F and FF are significantly different, in favor of 
FF. This is better highlighted by Figure 5. 



Remark. The most prominent evidence from the experiments is the better perfor- 
mance of FF versus the traditional filter F. In this note we look at FF from another 
perspective to exemplify and explain its behavior. 

FF includes an attribute if p(I > e\n) > p, according to its definition. Let us 
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Figure 5: Prediction accuracies of 
the naive Bayes with filters F and 
FF on the Hypothyroidloss data 
set. (BF is not reported because 
there is no significant difference 
with the F curve.) The differences 
between F and FF are significant 
in the range of observations 71- 
374- The maximum difference is 
achieved at observation 71, where 
the accuracies are 0.986 (FF) vs. 
0.930 (F). 



assume that FF is realized by means of the Gaussian rather than the Beta ap- 
proximation (as in the experiments above), and let us choose pm 0.977. The con- 
dition p(I > e\n) > p becomes e < E[I} — 2- ^/Var[7], or, in an approximate way, 
I(tt) > e + 2- A/Var[J], given that I(tt) is the first-order approximation of E[I] (cf. 
(4)). We can reg ard e + 2- v /Var[J] as a new threshold e'. Under this interpretation, 
we see that FF is approximately equal to using the filter F with the bigger threshold 
e'. This interpretation makes it also clearer why FF can be better suited than F 
for sequential learning tasks. In sequential learning, Var[7] decreases as new units 
are read; this makes e' a self- adapting threshold that adjusts the level of caution 
(in including features) as more units are read. In the limit, e' is equal to e. This 
characteristic of self-adaptation, which is absent in F, seems to be decisive to the 
success of FF. 

11 Conclusions 

This paper has provided fast and reliable analytical approximations for the variance, 
skewness and kurtosis of the posterior distribution of mutual information, with guar- 
anteed accuracy from 0{n~ v ) to 0(n~ 3 ), as well as the exact expression of the mean. 
These results allow the posterior distribution of mutual information to be approx- 
imated both from complete and incomplete data. As an example, this paper has 
shown that good approximations can be obtained by fitting common curves with 
the mentioned mean and variance. To our knowledge, this is the first work that 
addresses the analytical approximation of the distribution of mutual information. 
Analytical approximations are important because their implementation is shown to 
lead to computations of the same order of complexity as needed for the empirical 
mutual information. This makes the inductive approach a serious competitor of the 
descriptive use of mutual information for many applications. 

In fact, many applications are based on descriptive mutual information. We 
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have discussed how many of these could benefit from moving to the inductive side, 
and in particular we have shown how this can be done for feature selection. In this 
context, we have proposed the new filter FF, which is shown to be more effective 
for sequential learning tasks than the traditional filter based on empirical mutual 
information. 
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